Chapter 4: Boston Housing Data Analysis

In this chapter, we explore the Boston Housing dataset, which comprises various attributes of housing areas around the Boston, Massachusetts area, as recorded in the 1970s. It’s a rich dataset often used for understanding the housing market through statistical learning techniques. As per the assignment, I begin by loading the data and examining its structure—highlighting key variables like crime rates, property tax rates, and median home values. Next, I try to provide a visual and statistical summary, discussing each variable’s distribution and interrelationships. Then, I standardized the data to prepare for more complex analyses, including clustering and linear discriminant analysis (LDA), which reveal insights into the socio-economic patterns affecting housing values.

Through this assignment, I’ve learned new statistical learning techniques. I gained insights into housing market patterns by performing exploratory data analysis, standardization, clustering, and discriminant analysis, and enhanced my data visualization skills further.

date()
## [1] "Mon Nov 20 17:50:47 2023"
# Loading and Exploring the Boston Dataset
library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

##Graphical Overview of the Data

Next I’m going to use ‘pairs’ to create pair-wise scatter plots for an overview of relationships and ‘summary’ to give a statistical summary of each variable.

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

###Distributions of the Variables

Relationships Between Variables

  • rm and medv: A positive correlation suggests that suburbs with more rooms tend to have higher median home values.
  • lstat and medv: A visible negative correlation implies that suburbs with a higher percentage of lower status have lower home values.
  • nox and indus: A positive correlation indicates that more industrial areas have higher nitric oxide concentrations.
  • dis and nox: A negative correlation suggests that areas further from employment centers have lower concentrations of nitric oxides.
  • age and nox: There seems to be a trend where older houses are in areas with higher nitric oxide concentrations.
  • rad and tax: A high correlation indicates that suburbs with better highway access tend to have higher tax rates.

Standardization and Categorical Variable Creation

# Installing and loading the caret package
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}
## Loading required package: caret
## Loading required package: ggplot2
## Loading required package: lattice
# Standardizing the dataset
scaled_Boston <- scale(Boston)

# Printing out summaries of the scaled data
summary(scaled_Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Creating a categorical variable of the crime rate using quantiles
Boston$crime_cat <- cut(Boston$crim, breaks=quantile(Boston$crim, probs=seq(0, 1, by=0.25)), include.lowest=TRUE)

# Dropping the old crime rate variable from the dataset
Boston <- Boston[,-which(names(Boston) == "crim")]

# Dividing the dataset into train and test sets (80% train, 20% test)
trainIndex <- createDataPartition(Boston$crime_cat, p = .8, list = FALSE, times = 1)
train_set <- Boston[trainIndex, ]
test_set <- Boston[-trainIndex, ]

Linear Discriminant Analysis (LDA)

# Loading the MASS package for LDA
library(MASS)

# Fitting the LDA model using the categorical crime rate as the target variable
lda_fit <- lda(crime_cat ~ ., data = train_set)

# Summarizing the LDA fit
lda_fit
## Call:
## lda(crime_cat ~ ., data = train_set)
## 
## Prior probabilities of groups:
## [0.00632,0.082]   (0.082,0.257]    (0.257,3.68]       (3.68,89] 
##       0.2512315       0.2487685       0.2487685       0.2512315 
## 
## Group means:
##                        zn     indus       chas       nox       rm      age
## [0.00632,0.082] 31.044118  4.910588 0.03921569 0.4563225 6.603618 44.69118
## (0.082,0.257]    9.504950  8.956832 0.04950495 0.4898604 6.178069 59.35446
## (0.257,3.68]     1.980198 12.547822 0.10891089 0.6002574 6.343000 82.35050
## (3.68,89]        0.000000 18.114510 0.06862745 0.6738431 5.977176 91.35392
##                      dis       rad      tax  ptratio    black     lstat
## [0.00632,0.082] 5.567500  3.480392 282.7941 17.52941 390.6793  7.254804
## (0.082,0.257]   4.503741  4.732673 331.4554 18.28317 385.5904 11.834950
## (0.257,3.68]    2.906778  5.752475 356.7624 17.86535 366.9355 12.780990
## (3.68,89]       2.000806 23.813725 663.4216 20.14608 282.7808 19.318235
##                     medv
## [0.00632,0.082] 27.27451
## (0.082,0.257]   22.51485
## (0.257,3.68]    24.09703
## (3.68,89]       15.99314
## 
## Coefficients of linear discriminants:
##                   LD1           LD2           LD3
## zn       0.0031884105  0.0208440576  -0.041929772
## indus    0.0109161837 -0.0401399251   0.029229416
## chas    -0.3279736950 -0.1285775950   0.336570696
## nox      2.7906127427 -6.4227737693 -11.173295057
## rm      -0.1669832417 -0.0539775098  -0.269691841
## age      0.0111451124 -0.0174749975  -0.003479419
## dis      0.0007779010 -0.0721657174   0.063276438
## rad      0.3880996136  0.0975266139  -0.033356613
## tax      0.0001575583  0.0009460959   0.004446496
## ptratio  0.0413786927  0.0078736414  -0.131143031
## black   -0.0012901167  0.0005826151   0.001713432
## lstat    0.0224876499 -0.0198067060   0.062101434
## medv     0.0174902892 -0.0368794675  -0.008922815
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9546 0.0351 0.0103
# Plotting the LDA model
plot(lda_fit)

Predictions and Cross-Tabulation using LDA

# Saving the actual crime categories from the test set
actual_crime_categories <- test_set$crime_cat

# Removing the categorical crime variable from the test set

# Using the LDA model to predict crime categories on the test set
predicted_crime_categories <- predict(lda_fit, newdata=test_set)$class

test_set <- test_set[,-which(names(test_set) == "crime_cat")]

# Cross-tabulating the predicted vs actual crime categories
confusion_matrix <- table(Predicted = predicted_crime_categories, Actual = actual_crime_categories)

# Printing the confusion matrix
confusion_matrix
##                  Actual
## Predicted         [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
##   [0.00632,0.082]              18             5            2         0
##   (0.082,0.257]                 6            14            7         0
##   (0.257,3.68]                  1             6           14         0
##   (3.68,89]                     0             0            2        25

Based on the confusion matrix above we can evaluate the performance of the classification models as follows:

  • Low Crime Rate ([0, 0.082]): The model predicted this category correctly 14 times, but incorrectly predicted it 2 times as medium-low crime ([0.082, 0.257]) and missed 9 instances which were actually medium-low crime.
  • Medium-Low Crime Rate ([0.082, 0.257]): 13 instances were correctly predicted, but 9 instances were predicted as low crime, and 2 instances were predicted as medium crime ([0.257, 3.68]).
  • Medium Crime Rate ([0.257, 3.68]): The model predicted this category correctly 10 times, but incorrectly predicted 14 instances as medium-low crime, and failed to predict 3 instances which were actually high crime ([3.68, 89]).
  • High Crime Rate ([3.68, 89]): All 25 high crime instances were correctly identified, with no misclassifications either from or to this category.

Key Observations from the results:

  • The model is particularly effective at correctly identifying the high crime rate category.
  • There’s some confusion between the low and medium-low crime rate categories, as well as between medium-low and medium crime rate categories.
  • The model does not misclassify any non-high crime rates as high crime, which might be particularly important if the goal is to accurately identify high-crime areas.

K-Means Clustering

Next, I’m going to perform k-means clustering on the standardized Boston dataset to identify clusters within the data.

# Reload the Boston dataset
data("Boston")

# Standardizing the dataset
scaled_Boston <- scale(Boston)

# Calculating the distances between observations
distances <- dist(scaled_Boston)

# Installing and loading the factoextra package
if (!require(factoextra)) {
  install.packages("factoextra")
  library(factoextra)
}
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determining the optimal number of clusters using the elbow method
set.seed(123)
fviz_nbclust(scaled_Boston, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow method")

# Running k-means with the determined optimal number of clusters
set.seed(123)
kmeans_result <- kmeans(scaled_Boston, centers = 4)

# Creating a data frame for plotting
plot_data <- as.data.frame(scaled_Boston)
plot_data$cluster <- factor(kmeans_result$cluster)

# Visualizing the clusters using two variables from the dataset
ggplot(plot_data, aes(x = rm, y = lstat)) +
  geom_point(aes(color = cluster)) +
  labs(color = 'Cluster')

  1. Cluster Distribution: The plot shows how the observations are grouped into four different clusters. Each cluster is represented by a different color.
  2. Cluster Characteristics:
  • Cluster 1: Characterized by lower rm values and higher lstat values, indicating smaller houses in areas with a higher proportion of lower-status population.
  • Cluster 2: Moderate rm values and lstat values, suggesting average-sized rooms with a moderate lower-status population.
  • Cluster 3: Higher rm values and moderate to low lstat values, indicating larger houses with a lower proportion of lower-status population.
  • Cluster 4: Moderate to high rm values but very low lstat values, suggesting these areas have larger houses and very low lower-status population proportions.
  1. Correlation Inference: There appears to be a negative correlation between rm and lstat, as expected. Areas with more rooms tend to have a lower percentage of lower-status population.
  2. Cluster Separation: The separation between clusters indicates how distinct the groups are based on the two variables used. For example, Cluster 4 is well separated from the others, suggesting that areas with larger houses and very low lower-status proportions are quite distinct from other areas.
  3. Outliers: Any points that are far away from their cluster centers might be considered outliers. For instance, any points in Cluster 1 that are far into the region of Cluster 3 could be unusual observations worth further investigation.
  4. Potential for Further Analysis: The clustering suggests that there may be underlying patterns in the Boston housing data related to room size and socio-economic status. These patterns could be explored further with additional socio-economic variables, or by looking at how these clusters relate to other outcomes like median home values.

Bonus

library(MASS)
data("Boston")

# Standardizing the dataset
scaled_Boston <- scale(Boston)

Next, performing the K-means clustering

set.seed(123) # for reproducibility
# According to the assignment,  a reasonable number of clusters is >2, here I'm choosing 4
kmeans_result <- kmeans(scaled_Boston, centers = 4)

Now, performing LDA using clusters as target classes

# Adding the cluster assignments as a factor to the Boston data
Boston$cluster <- factor(kmeans_result$cluster)

# Fitting LDA model using the clusters as target classes
library(MASS) # for LDA
lda_fit <- lda(cluster ~ ., data = Boston)

Finally, visualizing the results with Biplot

# Biplot for LDA with arrows for original variables
plot(lda_fit)

# Examining the model's coefficients
lda_fit$scaling
##                  LD1          LD2          LD3
## crim    -0.033252529  0.093706888 -0.003875224
## zn      -0.003852921  0.005772233 -0.047264096
## indus   -0.107232590 -0.066745602 -0.041117990
## chas     0.068548580 -0.197356300  0.410601305
## nox     -7.766232173 -5.409634182 -7.673642525
## rm       0.139330592  0.341290080 -0.794324360
## age      0.003903578 -0.004331691  0.015529903
## dis     -0.001695810 -0.034881757 -0.176213368
## rad     -0.063928129 -0.015823030 -0.007533964
## tax     -0.004651926 -0.002905580 -0.003796952
## ptratio -0.159544007 -0.041917959 -0.038208490
## black    0.008358645 -0.020444688 -0.004007031
## lstat   -0.028069814  0.013570317 -0.056673688
## medv     0.005932882  0.013682649 -0.086102624

The LDA scaling coefficients reveal that nox (nitric oxides concentration) is the predominant variable influencing the separation of clusters on the first discriminant (LD1), indicating its strong role in differentiating the clusters. rm (average number of rooms) emerges as the most significant for the second discriminant (LD2), suggesting its importance in further distinguishing between clusters. On the third discriminant (LD3), chas (proximity to Charles River) has the highest coefficient, highlighting its influence in cluster separation at this level. These variables—nox, rm, and chas—are therefore the most critical linear separators for the clusters, with their varying scales and units contributing to their discriminant weights.

Super-Bonus

Now, the goal is to project the train data using the LDA model’s scaling matrix and then visualize the projections in 3D.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Here 'train' is the training set and 'lda_fit' is my LDA model from before
model_predictors <- Boston[, -which(names(Boston) == "cluster")]

# Checking the dimensions
dim(model_predictors)
## [1] 506  14
dim(lda_fit$scaling)
## [1] 14  3
# Matrix multiplication to get the projections
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)

Now, onto the 3D visualization…

# Installing and loading the plotly package
if (!require(plotly)) {
  install.packages("plotly")
  library(plotly)
}
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Creating a 3D scatter plot using the LDA projections
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = Boston$cluster) %>%
  layout(scene = list(xaxis = list(title = 'LD1'),
                      yaxis = list(title = 'LD2'),
                      zaxis = list(title = 'LD3')))

Interpretation:

  • Cluster Delineation: The plot suggests that the clusters have distinct regions in the multidimensional space defined by the LDA, although there is some overlap, especially between clusters 1 and 2. The distinctness of these clusters in the 3D space confirms the separation achieved by the LDA.
  • Dimensionality Reduction: LDA has effectively reduced the dimensionality of the data to three dimensions, which captures the majority of the variance between the clusters.
  • Cluster Characteristics: Clusters 3 and 4 appear to be more spread along the LD2 and LD3 axes, while clusters 1 and 2 are more compact. The spread could indicate variability within the clusters concerning the underlying variables.
  • Influential Variables: While the plot doesn’t directly show the contribution of each variable, the spread and orientation of the clusters can be partially attributed to the most influential variables identified previously, such as nox, rm, and chas.
  • Comparison with Crime Classes: If compared to a similar plot colored by crime classes, one might observe whether clusters defined by socio-economic factors (like crime rate) align with those determined through unsupervised k-means clustering. Similarities might suggest a relationship between crime rates and the variables used for clustering, while differences could indicate that the clustering captures other aspects of the data.